The previous GenerateProbs and GenerateProbsPart2 focused on creating a CDF that can be indexed with a uniform random number to determine the coupon draw. But what if we went the other way? Each coupon occupies 1/N amount of space, but instead of a uniform random number, we draw a random variate from some distribution (probably beta) and that determines the likelihood that a coupon gets drawn. It's way easier to set up, but there may be a slowdown because we now draw random variates instead of uniform random numbers. Let's time a few options to generate beta distributed numbers.
In [1]:
%matplotlib inline
import numpy as np
from numpy.random import beta as npbeta
from random import betavariate as pybeta
from scipy.stats import beta as scibeta
from matplotlib import pyplot as plt
from numpy import arange, vectorize
import timeit
In [2]:
start = timeit.default_timer()
for i in np.arange(1000000):
t = np.random.rand()
et = timeit.default_timer() - start
print(et)
In [3]:
start = timeit.default_timer()
for i in np.arange(1000000):
t = npbeta(1,1)
et = timeit.default_timer() - start
print(et)
In [4]:
start = timeit.default_timer()
for i in np.arange(1000000):
t = pybeta(1,1)
et = timeit.default_timer() - start
print(et)
In [5]:
start = timeit.default_timer()
for i in np.arange(1000000):
t = scibeta.rvs(1,1)
et = timeit.default_timer() - start
print(et)
It looks like using random variates will make the sim run a little slower, but not by a crazy amount if we use numpy's variate generation. Stay away from pure python and scipy's version though, they'll kill the speed.
Now that we are drawing coupons using random variates instead of generating a cdf, we can simulate the ccp with a single function:
In [6]:
def single_run_looped(n, dist, alpha, beta):
"""
This is a single run of the CCP.
n = number of unique coupons
m_max = max number of draws to simulate (should be much greater than n)
dist = how are the coupon probabilities distributed (uniform, normal, exponential)
norm_scale = how much to scale the normal distribution standard deviation (0<= norm_scale <1)
"""
m = 0 #start at zero draws
cdf = (arange(n)+1.0)/n #create the draw probability distribution
draws = [] #create our draw array
uniques = [] #create our unique array (deque is faster but may break DB inserts - research)
unique = 0
while True:
m+=1 #increment our draw counter
rv = npbeta(alpha, beta) #randomness that decides which coupon to draw
draw = (cdf>rv).sum()
if draw not in draws:
draws.append(draw)
unique+=1
uniques.append(unique) #store the info
if unique==n:#we'll stop once we have drawn all our coupons
return m #this line returns the number of draws; for testing
#return uniques #this line returns the full unique draws list; the actual data we want to record
vectorized_single_run_looped = vectorize(single_run_looped)
In [7]:
start = timeit.default_timer()
#test our sim with known results on uniform probs
trials = 200000
n = 10
records = vectorized_single_run_looped([n]*trials, ['beta']*trials, [1.0]*trials, [1.0]*trials)
num_fails = np.where(records==0)
average = np.mean(records)
std_error = np.std(records)/np.sqrt(trials)
z_crit = 1.96
low_ci = average - z_crit*std_error
high_ci = average + z_crit*std_error
expectation = np.asarray([1/(n+1) for n in np.arange(n)]).sum()*n
et = timeit.default_timer()-start
print ("num_fails: ", len(num_fails[0]))
print("low_ci: ", low_ci, "point_est: ", average, "high_ci: ", high_ci)
print("expected value: ", expectation)
print("elapsed_time: ", et)
So the version with random variates is ~10% slower, but that is a small price to pay for a coupon drawing system that "makes sense" and doesn't have pesky scaling and wierd math to make sure the probabilities come out right. Plus, we can use the beta distribution which is almost as nifty statistically as the normal distribution and we can easily iterate over alpha and beta values. Now all we need to do is figure out a way to store all this data and create a table with it.